The neural style transfer on portrait project aims to take two images, a content image(human portrait specifically) and a style reference image (such as an artwork by a famous painter), and use certain algorithm to convert the human's selfies into a another portrait that is artistically processed.
The target portrait result should have, the similar objects and placements of the content image; and visual styles, colors, shades and textures of the style image.
The flow chart below, briefly describes the procedures we are going to go through in this project.
To successfully reach the desired potrait result, there are two essential models need to be mentioned: The Optimal Descriptor Network (MODNet) and VGG-19.
Material Optimal Descriptor Network (MODNet), which is a light-weight matting objective decomposition network that can process portrait matting from a single input image in real time, to remove the background information of the picture, was employed in our model. The background information in the portrait picture can be huge distraction and difficulty for us to pick a relatively consistent weights and layers that can be applied on most portrait images.
On the other hand, a well-known pretrained convolution neural network (CNN), VGG19, is used to extract features from both content image and style image. It is particularly important in the processes later on, because it can help us to actually transfer the style by definig a loss function to minimize the differences between the feature of out content, style and target images.
Even though it is known that nerual style transfer can be applied on many types of content images, we are motivated by the unique human-centered characteristic that portrait image carries.
It not only gives us the opportunity of taking a deeper dive into manipulating the nuances of weights, layers and picture background to reach a relatively stable hyperparameter choice by eliminating overwhelming number of choices of types of content images, but also has the potential to encourage more people to apply neural style stransfer this technique from simply appreciating it. The curiosity of seeing self-portrait in different styles and nature of sharing, will help pepople engage more on using this technique and sharing the results in many scenes, such as social platforms, family gathering, and festival events. Neural style stransfer on portrait can be applied on all these scenes and reach a positive outcome by increasing the social interactions.
!git clone https://github.com/thiagoambiel/PortraitStylization.git
%cd /content/PortraitStylization
Here we import the github files for the BackgroundRemoval package.
%cd /content/PortraitStylization
!pip install -r requirements.txt
%reload_ext autoreload
%autoreload
import io
from torch import nn
import torch
import numpy as np
import torch.optim as optim
from torchvision import transforms, models
from PIL import Image, ImageColor
import matplotlib.pyplot as plt
from ipywidgets import widgets, interact
from IPython.core.display import display, HTML
#from style_transfer import StyleTransfer
from remove_bg import BackgroundRemoval
Here we use the "features" portion of VGG19 as our style transfer method. It is noted that we want our weights fixed in our model, so we stop updating the current parameters.
vgg = models.vgg19(pretrained=True).features
#stop updating the current parameters
for param in vgg.parameters():
param.requires_grad_(False)
# we move the model to GPU, if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# the vgg19 structure is displayed below
vgg.to(device)
We first build functions load_image, load_image_uploader, which could help us load the images we need.
def load_image(img_path, shape=None):
''' Load in and transform an image, making sure the image
is <= 256 pixels in the x-y dims.'''
image = Image.open(img_path).convert('RGB')
# large images will slow down processing
if max(image.size) > 256:
size = 256
else:
size = max(image.size)
if shape is not None:
size = shape
in_transform = transforms.Compose([
transforms.Resize(size),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406),
(0.229, 0.224, 0.225))])
# discard the transparent, alpha channel (that's the :3) and add the batch dimension
image = in_transform(image)[:3,:,:].unsqueeze(0)
return image.to(device)
def load_image_uploader(image):
''' Load in and transform an image, making sure the image
is <= 256 pixels in the x-y dims.'''
image = image.convert('RGB')
# large images will slow down processing
if max(image.size) > 256:
size = 256
else:
size = max(image.size)
in_transform = transforms.Compose([
transforms.Resize(size),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406),
(0.229, 0.224, 0.225))])
# discard the transparent, alpha channel (that's the :3) and add the batch dimension
image = in_transform(image)[:3,:,:].unsqueeze(0)
return image.to(device)
Then we loaded the content images and three style images.
!wget https://oceanmhs.org/wp-content/uploads/2018/01/starrynight.jpg #download style1
class ImageUploader:
def __init__(self):
self.data = []
self.output = widgets.Output()
self.uploader = widgets.FileUpload()
def save(self, _):
with self.output:
for name, file_info in self.uploader.value.items():
img = Image.open(io.BytesIO(file_info['content']))
self.data.append(img)
def run(self):
display(self.output, self.uploader)
self.uploader.observe(self.save, names='_counter')
content_uploader = ImageUploader()
content_uploader.run()
plt.imshow(content_uploader.data[0])
# try vgg19 without background removal
content_image = content_uploader.data[0]
Here we use the render function and the background removeal package to isolate the portrait and replace the background information with black color
original_image = content_uploader.data[0]
def render(bgcolor, fgcolor, fg_fac, bt_fac, image):
result = background_removal.remove_background(
img=image,
alpha=alpha,
bg_color=bgcolor,
bt_fac=bt_fac,
fg_color=fgcolor,
fg_fac=fg_fac
)
result_data.clear()
result_data.insert(0, result)
background_removal = BackgroundRemoval(weights_path="./weights/modnet.pth", device=device)
alpha = background_removal.gen_alpha(np.array(original_image))
result_data = []
render(
bgcolor='#000000',
fgcolor='#ffffff',
fg_fac=0, #ForeFac 0-1
bt_fac=1, #TextureFac 0-1
image = original_image
)
content_image = result_data[0]
plt.imshow(content_image)
content = load_image_uploader(content_image)
original = load_image_uploader(original_image)
# Resize style images to match content
style_starry = load_image('starrynight.jpg', shape=content.shape[-2:]).to(device)
style_leaf = load_image('style_leaf.jpg', shape=content.shape[-2:]).to(device)
style_marble = load_image('style_marble.jpg', shape=content.shape[-2:]).to(device)
style_pattern5 = load_image('style_pattern5.jpg', shape=content.shape[-2:]).to(device)
Now we will double check the shape of content and style image.
print(content.shape, style_starry.shape)
# helper function for un-normalizing an image
# and converting it from a Tensor image to a NumPy image for display
def im_convert(tensor):
""" Display a tensor as an image. """
image = tensor.to("cpu").clone().detach()
image = image.numpy().squeeze()
image = image.transpose(1,2,0)
image = image * np.array((0.229, 0.224, 0.225)) + np.array((0.485, 0.456, 0.406))
image = image.clip(0, 1)
return image
After converting the images, now we will display the images.
# display the images
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20, 20))
# content and style ims side-by-side
ax1.imshow(im_convert(content))
ax1.set_title("Content Image")
ax2.imshow(im_convert(style_starry))
ax2.set_title("Style Image 1")
ax3.imshow(im_convert(original))
ax3.set_title("Original Image")
plt.show()
Here we specify the content layer and style layers in the get_features function. The reasons for us to choose these layers to as content and style representations can be found out in the summary
def get_features(image, model, layers=None):
""" Run an image forward through a model and get the features for
a set of layers. Default layers are for VGGNet matching Gatys et al (2016)
"""
## Need the layers for the content and style representations of an image
if layers is None:
layers = {'0': 'conv1_1',
'5': 'conv2_1',
'10': 'conv3_1',
'19': 'conv4_1',
'25': 'conv4_2', ## content representation
'28': 'conv5_1'}
features = {}
x = image
# model._modules is a dictionary holding each module in the model
for name, layer in model._modules.items():
x = layer(x)
if name in layers:
features[layers[name]] = x
return features
style_layers, content_layers = [0, 5, 10, 19, 28], [25]
#reference: https://blog.csdn.net/qq_39906884/article/details/124658508
def extract_features(X, content_layers, style_layers):
contents = []
styles = []
for i in range(len(vgg)):
X = vgg[i](X)
if i in style_layers:
styles.append(X)
if i in content_layers:
contents.append(X)
return contents, styles
contents_Y = extract_features(content, content_layers, style_layers)[0]
Here, the gram matrix, also style matrix, is used to compute a element-wise inner product matrix of feature maps in a given layer in CNN, to capture the “distribution of features”.
def gram_matrix(tensor):
""" Calculate the Gram Matrix of a given tensor
Gram Matrix: https://en.wikipedia.org/wiki/Gramian_matrix
"""
# get the batch_size, depth, height, and width of the Tensor
_, d, h, w = tensor.size()
# reshape so we're multiplying the features for each channel
tensor = tensor.view(d, h * w)
# calculate the gram matrix
gram = torch.mm(tensor, tensor.t())
return gram
def style_transfer(content, style, vgg, sw1, sw2, sw3, sw4, sw5, style_weight, epochs):
#####################################################################1
# get content and style features only once before training
content_features = get_features(content, vgg)
style_features = get_features(style, vgg)
# calculate the gram matrices for each layer of our style representation
style_grams = {layer: gram_matrix(style_features[layer]) for layer in style_features}
# create a third "target" image and prep it for change
# it is a good idea to start of with the target as a copy of our *content* image
# then iteratively change its style
#-----------------------
# changed
# changed our input. add some noise on our input image.
target = content.clone().to(device)
random_img = torch.randn(content.data.size()).cuda()
target = 0.6 * target + 0.4 * random_img
target.requires_grad_(True)
#########################################################################2
style_weights = {'conv1_1': sw1,
'conv2_1': sw2,
'conv3_1': sw3,
'conv4_1': sw4,
'conv5_1': sw5}
content_weight = 1 # alpha
style_weight = style_weight # beta
##########################################################################3
# changed
# changed show_every, lr in optimizer, num_epochs
# changed the way computing content loss
# changed the calcuation method on style loss (changed to MSEloss)
# changed the way add content loss and style loss
# for displaying the target image, intermittently
show_every = 1000
# iteration hyperparameters
optimizer = optim.Adam([target], lr=0.003)
num_epochs = epochs # decide how many iterations to update your image (5000)
for ii in range(1, num_epochs+1):
#---------------------------------
# get the features from your target image
target_features = get_features(target, vgg)
# the content loss
contents_Y_hat, styles_Y_hat = extract_features(target, content_layers, style_layers)
content_loss = [(torch.nn.MSELoss(reduction='mean')(Y_hat, Y.detach())) * content_weight for Y_hat, Y in zip(contents_Y_hat, contents_Y)]
#---------------------------------
# the style loss
# initialize the style loss to 0
style_loss = 0
# then add to it for each layer's gram matrix loss
for layer in style_weights:
# get the "target" style representation for the layer
target_feature = target_features[layer]
target_gram = gram_matrix(target_feature)
_, d, h, w = target_feature.shape
# get the "style" style representation
style_gram = style_grams[layer]
# the style loss for one layer, weighted appropriately
#---------------------------------
layer_style_loss = style_weights[layer] * (torch.nn.MSELoss(reduction='mean')(target_gram, style_gram))
#---------------------------------
# add to the style loss
style_loss += layer_style_loss / (d * h * w)
# calculate the *total* loss
#---------------------------------
total_loss = content_loss[0] + style_weight * style_loss
#---------------------------------
# update your target image
total_loss.backward()
optimizer.step()
optimizer.zero_grad()
# display intermediate images and print the loss
if ii % show_every == 0:
print('epoch: ',ii, "Total loss: ", total_loss.item())
plt.imshow(im_convert(target))
plt.show()
#@markdown Adjust Weights for Better Results.
#@markdown Weights for each style layer
first_layer = 0.7 #@param {type: "slider", min: 0.0, max: 1.0, step: 0.1}
second_layer = 0.5 #@param {type: "slider", min: 0.0, max: 1.0, step: 0.1}
third_layer = 0.2 #@param {type: "slider", min: 0.0, max: 1.0, step: 0.1}
fourth_layer = 0.3 #@param {type: "slider", min: 0.0, max: 1.0, step: 0.1}
fifth_layer = 0.1 #@param {type: "slider", min: 0.0, max: 1.0, step: 0.1}
#@markdown Overall weight of style iamge
style_weight = 1000 #@param {type: "number"}
#@markdown number of training epochs
epochs = 7000 #@param {type: "slider", min: 5000, max: 12000, step: 1000}
style_transfer(content, style_starry, vgg,first_layer, second_layer, third_layer, fourth_layer, fifth_layer, style_weight, epochs)
As more epochs are processed, we can see that the generated image is adopting the visual styles and having more textures in details, while the total loss is dropping from 3257 to 193.
However, it is noted that in the case of neural style transfer on portrait, based on these multiple output images from different number of epochs, closed to zero total loss doesn't necessarily mean better style transfer, since there is a risk of over-transferring and distorting the content image's objects.
Thus, it's a relatively subjective standard when we tune the hyperparameters, because we foucs more on the actual output image than the total loss.
style_transfer(content, style_leaf, vgg,first_layer, second_layer, third_layer, fourth_layer, fifth_layer, style_weight, epochs)
style_transfer(content, style_marble, vgg,first_layer, second_layer, third_layer, fourth_layer, fifth_layer, style_weight, epochs)
style_transfer(content, style_pattern5, vgg,first_layer, second_layer, third_layer, fourth_layer, fifth_layer, style_weight, epochs)
It is common that most of the style transfer using VGG-19 model only. However, the background information in the portrait picture can be huge distraction. Now we will compare the VGG-19 model without background removal and our model (i.e. VGG-19 model with background removal)
style_transfer(original, style_starry, vgg,first_layer, second_layer, third_layer, fourth_layer, fifth_layer, style_weight, epochs)
style_transfer(original, style_marble, vgg,first_layer, second_layer, third_layer, fourth_layer, fifth_layer, style_weight, epochs)
From the output above, we could notice that for the transfered images without background removal, the background of the portraits are dirty and messy, and the virtual focus of the content image is not transfered into an aesthetical way.
While when the background is removed, the pixels surrounding the head portrait are in a single color and can be transfered more completely into the shape and color of the style, which makes the entire picture looks more harmonious.
The model has been tuned with hyper-parameters to get the best performance. Now it can be tested by uploading new content image and style image.
# load test content image
content_test_uploader = ImageUploader()
content_test_uploader.run()
# load test style image
style_test_uploader = ImageUploader()
style_test_uploader.run()
# check uploaded images
# display the images
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 20))
# content and style ims side-by-side
ax1.imshow(content_test_uploader.data[0])
ax1.set_title("Test Content Image")
ax2.imshow(style_test_uploader.data[0])
ax2.set_title("Test Style Image")
plt.show()
#remove test content image background
content_test_image = content_test_uploader.data[0]
style_test_image = style_test_uploader.data[0]
background_removal = BackgroundRemoval(weights_path="./weights/modnet.pth", device=device)
alpha = background_removal.gen_alpha(np.array(content_test_image))
result_data = []
render(
bgcolor='#000000',
fgcolor='#ffffff',
fg_fac=0,
bt_fac=1,
image = content_test_image
)
content_test_image = result_data[0]
#check the test content image with background removal
plt.imshow(content_test_image)
plt.title("Content Image with Background Removal")
# plot result of test images
content = load_image_uploader(content_test_image)
style = load_image_uploader(style_test_image)
contents_Y = extract_features(content, content_layers, style_layers)[0]
style_transfer(content, style, vgg,first_layer, second_layer, third_layer, fourth_layer, fifth_layer, style_weight, epochs)
The result of the new data shows that it has a relatively larger total loss at around 3543. This may be because that the style image is more dominant to the transfered image and the human's face is distorted heavily. While this can still be an acceptable result since the combination of the two photos is in a creative way and looks esthetic.
Based on both training outputs and test outputs above, we can see that the target images combine the content features and style features well, indicating that the model can successfully apply styles of famous paintings to your own selfie, and get entertaining and visually pleasing results.
After number of implementation in testing target images, it reveals that our choice of removing background information, content layer, style layers, style weights, content weight and loss composition are reasonable for this model.
After comparing the background-removed target image and non-background-removed image, we realize that the noisy background imformation of the content image, after neural style transfer, would often be distorted and loses its original meaning since the arrangements of the background information is hard to capture while the content image is mainly made up by portrait.
For choosing the conent layer, the lower the layer level, the generated image tends to have a more aligned match with the content image. Since our purpose is to have a target image that just has the general arrangements of human portrait in content image, we choose the conv4_2.
On the other hand, because we wanna transfer both the global visual styles and texture details from style image, the choices of style layers are generally distrbuted over the whole VGG19 as the first layer of each group of CNN layers.(conv1_1, conv2_1, conv3_1, conv4_1, conv5_1)
One of the most important parts of hyperparameters tuning in this project is the style weights. More weights on lower style layers can cause larger style artifacts in the target image. More emphasis on feature details when more weights on higher style layers. We decide to have bigger sum of wieghts in lower style layers, because we discover that the facial expression of the portrait can be blurred by too many weights on the higher style layers.
The total loss is the sum of weighted content loss and style loss, After tuning, the content loss's weight is simply set as 1, to make sure the general arrangements of human portrait is kept, while style loss's weight is set to be 1000 to sufficiently transfer the visual style.
The model only did a good job combining the styles of landscape and still life painting, it is hard to transfer the styles of the head portrait painting to the real selfie images with exactly same facial layuts on each object, such as eyes and mouth.
To achieve this goal, we need more complex models to extract feaures, such as 3D facial mesh model, and more flexible control of the features of content and style, such as using object detection to individually work on each part of the portrait.
If these approaches above are achieved, there are also possiblity for us to extend the approach of style transfer, such as blend different styles in to one content image. Specifically, we can separate objects on content image into different categories, and transfer different style to each category.
On the other hand, the output results above are mostly capturing the general information of the style information, and that's why the original color on the content image is "washed away" and is relaced by style image's color after a few thousand epochs. There is still much to explore on models that can ignore the color of style image, but only capture the essence of the style image, like the pattern and the shape of lines. In that way, output result would not only have the transferred style, but also maintain the original color. However, this idea may require enormous computing resources since it's more like a pixel to pixel operation.
These all can be the possible future improvements for this project.
%%shell
jupyter nbconvert --to html /content/Team_17_Project_Walkthrough.ipynb